This Page Is Inserted by IFW Operations 
and is not a part of the Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of 
the original documents submitted by the applicant. 

Defects in the images may include (but are not limited to): 

• BLACK BORDERS 

• TEXT CUT OFF AT TOP, BOTTOM OR SIDES 

• FADED TEXT 

• ILLEGIBLE TEXT 

• SKEWED/SLANTED IMAGES 

• COLORED PHOTOS 

• BLACK OR VERY BLACK AND WHITE DARK PHOTOS 

• GRAY SCALE DOCUMENTS 



IMAGES ARE BEST AVAILABLE COPY. 



As rescanning documents will not correct images, 
please do not report the images to the 
Image Problem Mailbox. 



WORLD INTELLECTUAL PROPERTY ORGANIZATION 
International Bureau 




PCT 

INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification 6 : 
G06F 9/46 



Al 



(11) International Publication Number: WO 98/03910 

(43) International Publication Date: 29 January 1998 (29.01.98) 



(21) International Application Number: PCT/GB 97/02006 

(22) International Filing Date: 24 July 1997 (24.07.97) 



(30) Priority Data: 

9615532.0 
9621947.2 



24 July 1996 (24.07.96) GB 
22 October 1 996 (22. 10.96) GB 



(71) Applicant (for all designated States except US): HEWLETT- 

PACKARD COMPANY [US/US]; 3000 Hanover Street, 
Palo Alto, CA 94304 (US). 

(72) Inventors; and 

(75) Inventors/Applicants (for US only): CARROLL. Jeremy, John 
[GB/IT]; Via Ernesto Rossi, 65, 1-57125 Livomo (IT). 
BORSHCHEV, Andrei Vladilenovich [RU/RU]; Socialis- 
ticheskaya, 4-33, St.Petersburg, 191002 (RV). 

(74) Agents: LAINE. Simon, James et al.; Wynne-Jones, Latne & 
James. 22 Rodney Road. Cheltenham, Gloucestershire GL50 
1JJ (GB). 



(81) Designated States: JP. US. European patent (AT, BE, CH, DE. 
DK, ES. Fl. FR.' GB, GR, IE, IT. LL\ MC, NL, PT. SE). 



Published 

With international search report. 



(54) Title: ORDERED MESSAGE RECEPTION IN A DISTRIBUTED DATA PROCESSING SYSTEM 




(57) Abstract 



A complex computing system has a plurality of nodes interconnected by channels through which data messages are exchanged. The 
underlying principle is that after arrival at a node of a message, delivery of that message is delayed until after delivery and consequences of 
all more senior messages which affect the node. The messages are progressively lirnestamped 1 at each node so that each time tamp contains 
generation by generation indicators of the origin of the associated message. The seniority of that message ts uniquely deterrmned thereby 
and total ordering of the messages can be achieved. When comparing tirnestamps for such ordering, compamon of respective generation 
indicators is necessary only until there is a distinction. 



. <WO 9B03910A1_I_> 



FOR THE PURPOSES OF INFORMATION ONLY 
Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT. 



AL 


Albania 


ES 


Spain 


LS 


AM 


Armenia 


FI 


Finland 


LT 


AT 


Austria 


FR 


France 


LU 


AU 


Australia 


GA 


Gabon 


LV 


AZ 


Azerbaijan 


GB 


United Kingdom 


MC 


BA 


Bosnia and Herzegovina 


GE 


Georgia 


MD 


BB 


Barbados 


GH 


Ghana 


MG 


se 


Belgium 


GN 


Guinea 


MK 


BF 


Burkina Faso 


GR 


Greece 




BG 


Bulgaria 


HU 


Hungary 


ML 


BJ 


Benin 


IE 


Ireland 


MN 


BR 


Brazil 


IL 


Israel 


MR 


BY 


Belarus 


IS 


Iceland 


MW 


CA 


Canada 


IT 


Italy 


MX 


CF 


Central African Republic 


JP 


Japan 


NE 


CC 


Congo 


KE 


Kenya 


NL 


CH 


Switzerland 


KG 


Kyrgyutan 


NO 


CI 


C6«e d*l voire 


KP 


Democratic People's 


NZ 


CM 


Cameroon 




Republic of Korea 


PL 


CN 


China 


. KR 


Republic of Korea 


FT 


CU 


Cuba 


KZ 


Kazakstan 


RO 


CZ 


Czech Republic 


LC 


Saint Lucia 


RU 


DC 


Germany 


LI 


Liechtenstein 


SD 


DK 


Denmark 


LK 


Sri Lanka 


SE 


EE 


Estonia 


LR 


Liberia 


SG 



Lesotho 


SI 


Slovenia 


Lithuania 


SK 


Slovakia 


Luxembourg 


SN 


Senegal 


Latvia 


SZ 


Swaziland 


Monaco 


TD 


Chad 


Republic of Moldova 


TG 


Togo 


Madagascar 


TJ 


Tajikistan 


The former Yugoslav 


TM 


Turkmenistan 


Republic of Macedonia 


TR 


Turkey 


Mali 


TT 


Trinidad and Tobago 


Mongolia 


UA 


Ukraine 


Mauritania 


LG 


Uganda 


Malawi 


US 


United Staut of America 


Me* ico 


uz 


Uzbekistan 


Niger 


VN 


Viet Nam 


Netherlands 


YU 


Yugoslavia 


Norway 


ZW 


Zimbabwe 



New Zealand 
Poland 
Portugal 
Romania 

Russian Federation 

Sudan 

Sweden 

Singapore 



vISDOCID: <WO 9803910A1_L> 



WO 98/03910 



1 



PCT/GB97/02006 



ORDERED MESSAGE RECEPTION IN A DISTRIBUTED DATA PROCESSING SYSTEM 

This invention relates to complex computing systems. 
It was developed primarily to answer a problem with distrib- 
uted systems, but it has been realised that it is equally 
applicable to systems which, are not normally considered to 
be distributed, such as a multi-processor computer. 
Although their physical separation may be negligible, 
nonetheless the processors are distinct and form a "distrib- 
uted" system within the computer to which this invention is 
applicable . 

A landmark paper on distributed systems is that of 
Lamport ("Time, Clocks and the Ordering of Events in a 
Distributed System" - Communications of the ACM Vol. 21 No. 
7, 1978 pp 558-565). In that, a distributed system is 
defined as a collection of distinct processes which are 
spatially separated and which communicate with one another 
by exchanging messages, and in which the message trans- 
mission delay is not negligible compared to the time between 
events in a single process. In such a system, it is 
sometimes impossible to say that one of two events occurred 
first. Lamport proposed a logical clock to achieve a 
partial ordering of all the events, and he postulated a 
sinale integer timestamp on each message, corresponding to 
the time the message was sent. 

Fidge (in "Logical Time in Distributed Computing 
Systems" - IEEE Computer 24(8) August 1991 pp 28-33) argued 
that the time stamps of Lamport clocks (totally ordered 
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logical clocks) impose on unrelated concurrent events an 
arbitrary ordering, so that the observer cannot distinguish 
from genuine causal relationships. He proposed partially 
ordered time readings and timestamping rules which enable a 
5 causal relationship between two events to be established. 
Their order could then be determined. But where there is no 
causal relationship between events, no definitive order 
exists, and different total orderings of events (or 
interleavings) are possible. This means that some messages 
10 are assigned an arbitrary order. 

This ordering problem is known as the "race condition 
problem" and it can be illustrated by a simple analogy. A 
dictates a first message to secretary B, who faxes the 
typed version to C . A telephones C with a second message. 
Unless ordered, the communication system will not know 
whether the first or second message reached C first, 
although it will know that the dictation preceded the fax. 

It is the aim of this invention to resolve this problem 
and co allow someone to programme a distributed system as if 
he was programming a uni -processor . In other words, he can 
think about time linearly and he will not have to be 
concerned about concurrency or the race condition problem. 

According to one aspect of the present invention there 
is provided a complex computing system comprising a plural - 
25 ity of nodes connected to each other by channels along which 
timestamped data messages are sent and received, each 
timestamp being indicative, generation by generation, of its 
seniority acquired through its ancestors' arrival in the 
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system and in any upstream nodes, and each node comprising: 
means for storing each input data message, 

means for determining the seniority of input data messages 
by progressive comparison of respective generations in the 
timestamps until the first distinction exists, 
means for delivering these messages for processing, 
means for applying a timestamp to each output message 
derived from such processing comprising the immediately 
ancestral message's timestamp augmented by a new generation 
seniority indicator consistent with the ordering, and 
means for outputting such ordered and timestamped messages. 

The delivery means will generally be arranged to 
deliver messages in order according to which message has the 
most senior timestamp indicator. 

For a data message received from outside the system the 
initial timestamp indicator will preferably include an 
indication of the time of receipt of said data message at 
the node, while for a data message generated by a node of 
the system the new generation seniority indicator of the 
timestamp will preferably include an indication of the place 
of said data message in the ordered sequence of such 
messages at said node. This indication may be real time or 
logical time . 

Conveniently, monotonic integers are utilised as said 
generation seniority indicators in the timestamps. 

Advantageously, the delivery means of a node delivers 
data messages only either once a message has been received 
on each of the input channels of said node or when at least 
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one data message received on each of the input channels of 
said node is stored in the storage means. 

Preferably each node will be adapted to perform at 
least one channel flushing routine triggerable by lack or 
paucity of channel traffic. 

Ideally, all data messages caused by a first data 
message anywhere in the system will be delivered to a node 
before any messages caused by a second data message, junior 
to the first data message, are delivered to said node. 

According to another aspect of the present invention 
there is provided a method of ordering data messages within 
a complex computing system comprising a plurality of nodes 
connected to each other by channels along which data 
messages are sent and received, the method comprising, for 
15 each node, timestamping each message on arrival, queuing 
messages until a message has been received on each input 
channel to the node, and delivering the queued messages for 
processing sequentially in accordance with their timestamps, 
the message having the most senior timestamp being delivered 
first, wherein the timestamping at each node is cumulative 
so that the timestamp of a particular message indicates the 
seniority acquired by that message, generation by gener- 
ation, and wherein the seniority of one message against 
another is determined by the progressive comparison of 
respective generations in the timestamps until the first 
distinction exists. 

According to a further aspect of the present invention 
there is provided a complex computing' system comprising a 
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plurality of nodes 'between which data messages are 
exchanged, wherein after the arrival at a node of a message, 
delivery of the message by the node is delayed until after 
the delivery and consequences of all more senior messages 
which affect the node. 

Such a system may be either a distributed computing 
system, a symmetric mult i -processor computer, or a massively 
parallel processor computer. 

Assumptions 

To understand later explanations, certain assumptions 
about a distributed computing system will be set out. 

Such a system is a set of nodes or processes connected 
by FIFO (first in, first out) channels. Conventionally, 
'nodes' refer to the hardware and 'processes' to the 
software and operations performed at the nodes, but the 
terms may be used interchangeably here. Some of these 
processes have external channels through which they communi- 
cate with the system's environment, the whole system being 
driven by input messages through some external channels, and 
sending out an arbitrary number of consequential output 
messages through other external channels. 

Each process can be regarded as an application layer 
and a presentation layer, which handle the following events: 

(a) Message arrival (at the presentation layer) 

(b) Message delivery (from presentation to application 

layer) 

(ci Message send request (from application to 
presentation layer) 
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(d) Message send (from presentation layer) 

(e) Message processing complete (from application to 

presentation layer) . 
At any event (b) , the application layer 
5 i) generates one or more events (c) 

ii) changes the process state, and 

iii) generates event (e) - which indicates that it is 
ready to receive a further message. 

A set of such events will be termed a message handler 
10 invocation. Such invocations are the basic building blocks 

or atomic units of a distributed system, and a process 

history is a sequence of such invocations. Each invocation 

may affect subsequent invocations by changing the internal 

state of the process. 
15 At the application layer, the channels are simplex. 

However, auxiliary messages, from one presentation layer to 

the other, are allowed in both directions. 

There will be a global real-time clock, accessible from 

anywhere in the system. It is required only to be locally 
20 monotonic increasing, and there will be some bound on the 

difference between two simultaneous readings on the clock in 

different processes. 

The FIFO channels are static. 

Processes do not generate messages spontaneously. 
25 Each message in the system has exactly one destination. 

(These last three assumptions are working hypotheses 
■ which will be relaxed later) . 

Finally, for initial consideration, ''there are no loops 
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in the possible dataflows. This will be discussed further 
below . 

The time model 

The aim is to achieve a total ordering of the set of 
5 messages in the system. If messages are delivered to every 
process in time order and messages are sent along every 
channel in time order then the system is said to obey "the 
time model*' . A total ordering of the set of messages is 
equivalent to an infective mapping from the set of messages 
10 in the system to a totally ordered set (the time-line) . 

In this specification " < " will signify, in the rela- 
tionship m p < m q , that message w P precedes message m q in the 
total order. This gives the first principle of the time 
model: there is a unique time for everything. Such a time 
15 is simply a label useful for evaluating the time order rela- 
tionship between messages, and does not have any necessary 
relationship with real time. 

The relation "<" is based on two partial order rela- 
tions, " = " (sent before) and (strong causality) , as 
20 explained below. 

For any two external input messages, m a and m } , either 
m 0 m. or m. => m 0 . This is given by the environment,, 
typically by the clock time of message arrival. In other 
words, external input messages are totally ordered with 

25 respect to =». 

If message send requests for messages w 0 and m 3 occur 
during the same invocation at the behest of some third 
" message, the send request for m 0 being before m 2 , then m 0 => 
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m 2 . 

is che least partial order such that if the message 
send request for m 1 occurs during the invocation in response 
to m 0f then m 0 -*> m 1 . (i.e. a message strongly causes any 
5 messages sent by its handler) . 

The total order relation is determined by the 

following axioms: 

If m 0 => m J then m 0 < m t . 
If m 0 -» m l then m 0 < m } . 
10 If m 0 =* m 2 , m c -> m' c then w' & < m 2 . 

The first two axioms correspond to Lamport's axioms; 
the third, the strong causality axiom, is the heart of the 
time model of the present proposal . 

The idea behind the strong causality axiom is the 
15 following: if a process or the system's environment sends 
two messages (m 0 and m 2 ), one after another, then any 
consequence im' 0 ) of the first message (m n ) should happen 
before the second message (m,) and any of its consequences. 
This gives the second principle of the time model : 
20 there is enough time for everything (i.e. enough time for 
ail remote consequences to happen before the next local 
event) . 

For a better understanding of the invention reference 
will now be made, by way of example, with reference to the 
25 accompanying drawings, in which: 

Figure 1 is a diagram illustrating the total ordering 
of messages, 

Figure 2 is a diagrarri"of *a process "with its time 
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service, 

Figure 3 illustrates a message sequence of the time 
service , 

Figure 4 shows a network of processes, to explain 

5 channel flushing, 

Figure 5 is a diagram showing channel flushing messages 
and the structure of time service, 

Figure 6 illustrates a message sequence of channel 

flushing , 
!0 Figure 7 shows a bus, 

Figure 8 shows a delay, 

Figure 9 shows a false loop, and 

Figure 10 comprises diagrams of feedback through a bus. 

Referring to Figure 1, the diagram can be likened to a 
15 tree on its side with its root (to the left) representing 
the system's environment which generates external messages. 
The nodes (the vertical lines) are message handler 
invocations and the arrowed horizontal lines represent mess- 
ages . 

Using this tree it is easy to reconstruct the message 
relations. For example, a - Jb because b was sent while a was 
handled; Jb - c because b was sent before c in the same 
invocation. Also, a - f and x - z as these relations are 
5 transitive, f and e are incomparable under both and 

nevertheless f < e. To compare two messages with respect to 
the total order relation w <" one has to trace paths from the 
root to these messages. There can be three possible cases, 
which correspond to the three axioms. They are shown by the 
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following three examples taken from Figure 1 : - 

c lies on the path of d, hence, c -> d and therefore c 

< d. 

x and z have the same path, but x is sent before z, so 
5 x => z and x < 2. 

f~ and e have the same path prefix, but then their paths 
fork, and b (with b -> f) is sent before c (with c e) , 
which means b c, so f < c < e, giving f < e. 

If a distributed system follows this time model, i.e. 
10 if messages are delivered to each process in this order, and 
sent down each channel in this order, then the system's 
behaviour will be deterministic, independent of the speed of 
processes and channels. 

A possible time -line from which each message can be 
15 given a unique time is the set of sequences of integers. 
These are ordered using the standard dictionary ordering. 

The path from the root to a message fully identifies 
the message position in the total order relation "<" . This 
path can be codified as a representation of time in the 
20 distributed system. The names of processes along the path 
are immaterial; the only information needed is the relative 
order of the message ancestors at each process and the order 
of the initial external messages. So, in Figure 1, the time 
can be represented by an array of integers, e.g. [1,2,3,1,2] 
25 for e, [2,3} for z, [2,2,2,1,1] for y. However, since the 
system's environment may be distributed, it could be 
difficult to assign unique integers to each external 
message. A possible solution is to use real clock values 



nISDOCID: <WO 980391 OA 1 \ > 



WO 98/03910 



11 



PCT/GB97/02006 



combined with an external input identifier, this requiring 
that at every external input all real clock readings are 
unique and grow monotonically . 

The following C+ + class can be used for time: 



class TTime ( 
public 

TTime(): 

RealClocM 0.0 ). 

Input*. 0 }. 

Length(O) {} 
TTime{ float realclock, unsigned input ): 

RealClock( realclock ). 

lnput( input ). 

Length(O) () 
void AddNewProcessO 

{ Path| Length** ] = 0; } 
void operator**() { Path[Lenglh-1 }+♦; } 
friend bool operator<( TTime 10. TTime t1 ); 
private: 

float RealClock; 

unsigned Input. 

unsigned Path[ MAX_PATH ]; 

unsigned Length; 



friend TTime bool operalor<{ TTime 10, TTime 11 ) { 
if( tO. RealClock < t1. RealClock ) return TRUE; 
il( II. RealClock < 10. RealClock ) return FALSE; 
if( lO.lnput < U. Input ) return TRUE; 
if( 11. Input < tO.lnput ) return FALSE; 
for( unsigned i=0; i<mwv( tO. Length. 11. Length ); i** ) { 
i(( tO.Path(i) < U.Pathfj] ) return TRUE. 
if( t1.Path[il < t0.Palh[ij ) return FALSE; 

} 

return tO.Lenglh < n. Length. 

} 



10 



The more processes handle a message, the longer its 
path grows. Potentially, if there is a cycle in the system, 
paths can become arbitrarily long. 

The implementation should use (when possible) dynamic 
allocation to avoid the arbitrary upper limit on path 
length . 

Each node or message handler invocation in the distrib- 
uted system is structured as shown in Figure 2. All 
functionality related to support for the time model resides 
in the time service, so that a process does not know 
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anything about the time model. Each time it finishes 
handling a message it informs its time service (Done signal 
or event (e) above) . The time service has a local clock T of 
type TTime which is updated whenever a message is received 
or sent by its process. Initially the local clock has a 
value given by a default constructor and it is kept and used 
even while the process is idle. 

The "timestamp assigner" at the border with the 
system's environment has a real clock synchronized with the 
clocks of all other timestamp assigners, and a unique input 
identifier. 'Synchronized' here is understood to mean 
adequately synchronized, for example by means of the network 
time protocol as described in the Internet Engineering Task 
Force's Network Working Group's Request for Comments 1305 
entitled 'Network Time Protocol (Version 3) Specification, 
Implementation and Analysis' by David L Mills of the Univ. 
of Delaware published by the IETF in March 1992. Each time 
an external message enters the system it gets a unique 
timestamp constructed from these two values (see the second 
constructor of TTime) . It is assumed that the real clock 
progresses between each two messages. 

Input messages are not delivered to the process until 
there are messages present on all inputs. Once this condi- 
tion holds, the local clock is set to the timestamp of the 
most senior message, the new process is added to the path, 
and the most senior message is delivered. Every output 
message sent by the process while handling this input is 
timestamped by the time service with the current value of 
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the local clock; and then the clock is incremented. The next 
message can be delivered only after the process explicitly 
notifies the time service that it has finished with the 
previous one, is idle and waiting (Done) . The corresponding 
message sequence is shown in Figure 3. The basic algorithm 
Alg.l of the time service is shown in the table below. 



initial slate: Idle. 


Evont 


Action 


Stale Idle 


Input message arrives 


ll( there are messages on all Inputs ) IllJoM Inputs arc non-empty. 
DeUverTheOldestMessegeO; // dcllvenj Is possible 


Slalo Handling Message 


y Process sends output message 


Send it with the tlmestamp T; 

T++; //increment the last time in the timestamp 


Done 


lf( there are messages on all Inputs ) ( // [fall inputs are still non-empty. 
DellverTheOldeslMessageQ; // deliver Oie next oldest message 
return; 

J 

Next slate « Idle; 




Functions 


void DollvorThoOldeslMessageO { 

T o tlmestamp of Ihe oldest message; // First, the local clock Is set to tluz value of Die oldest 
T.AddNewProcess(); // tlmestamp. and a new process Is added to t/ic paUi In IL 
Doliver( the oldest message ); 
Noxl state » Handling Message; 

) 



This algorithm ensures that input messages are 
delivered to each process in the order of their timestamps, 
and that output messages are sent by each process in the 
10 order of their timestamps. Thus, the time service as 
described above fully implements the time model . 

However, a distributed system containing a cycle will 
not work, as all time services in the cycle will always be 
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missing at least one input message. Also, rare messages, 
either on an external input or on an internal process-to- 
process connection, may significantly slow down the whole 
system. 

5 Channel flushing can solve both these problems. Channel 

flushing is a mechanism for ensuring that a message can be 
accepted. The principle is to send auxiliary messages that 
enable the time service to prove that no message will arrive 
which is earlier than one awaiting delivery. Hence the 

10 waiting message can be delivered. 

There are two kinds, namely 'sender channel flushing', 
in which the sending end initiates channel flushing when the 
channel has been left unused for too long, and 'receiver 
channel flushing' , in which the receiving end initiates 

15 channel flushing when it has an outstanding message that has 
been awaiting delivery for too long. 

Receiver channel flushing will be considered first, in 
conjunction with Figure 4. For simplicity, timestamps and 
clocks are represented by single integers. 

20 Suppose for a certain period of time the two lower 

inputs of the process C are empty while there is a message 
with timestamp 23 waiting on the upper input. The time 
service of C wants to deliver the message as soon as 
possible, but it cannot do so until it proves that those 

25 messages that will eventually arrive on the empty inputs 
will have greater timestamps. . 

To prove it, C sends a channel flushing request "May I 
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accept a message of time 23?" to B and F, both of which have 
to forward this request deeply into the system, until either 
a positive or negative response can be given. In fact, to 
verify that C can accept the message with the timestamp 23 
in Fiaure 4, it is enough to ask only the processes shown, 
since all inputs to the diagram are at times later than 23. 

The algorithm described below is a straightforward 
implementation of channel flushing. All channels in the 
system (which actually connect the time services of the 
processes) are bi-directional, since, besides the normal 
uni-directional messages, the channel flushing messages are 
sent along them in the reverse direction. These messages and 
the structure of the time service are shown in Figure 5. 

The general idea is that each time the time service 
discovers that there are input messages waiting while some 
inputs are empty, it sets a flush timer. On the timeout 
event it starts the channel flush. It sends flush requests 
to all empty inputs, creates a (local) request record with 
the list of these inputs, and then waits for responses. If 
positive responses come from all inputs, the oldest message 
is delivered. If a negative response comes on any input the 
flush is cleared and re-scheduled. 

Requests from other time services are handled in the 
following way. First, the time service- tries to reply using 
" its local information (its local clock and the timestamps of 
waiting messages). If it is unable to do so, it creates a 
(remote) request record and forwards the request to all 
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empty inputs. If all responses are positive, so is the one 
to the remote requester. Otherwise, the response is No. 

The algorithm Alg.2 is presentee in a table below. 
Again, the time service has two states: Idle and Handling 
S Message . While in the latter state, the time service is only 
serving process output messages, whereas in the Idle state 
it does all channel flushing work for both itself and other 
processes. < t, Path, Inputs > represents a request record. 
[] is an empty path. New, Delete and Find are operations 
10 over an array of request records that maintain the local 
state of the receiver channel flush algorithm. P is the 
identifier of this process. T is the local clock. 
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Initial slate: Idle. Flush timer nol set. no request records. 


Evonl 


Action 


Stale Idle 


input message 
with time U 
arrives at input i 


lor( all < I, Path, Inputs > ): Path *[}) 11 First update remote requests. 

Il( I < 1| ) // input i is younger titan the request's lime. 
YesForRequest ( < I, Path, Inputs >, 1 ); 

// input i is older Uian lite request's time. 
NoForRequest ( < 1, Palh, Inputs > ); 
ll{ there are messages on all Inputs ) ( t! Tlteix, if ait inputs arc non-empty, 
CanceILocalFlushing(); 

Del»vcrThe01deslMessage{); // delivery is possible. 
return. 

} 

il( the new message Is the oldest one ) ( // If this message becomes the 
CancelLocalFlushingO; // oldest one, a new local 
Set Hush timer; fl flushing must be scheduled. 
return; 

) 

if( Find( < t, (|, Inputs > ) ) // Otherwise, iftlxerc is a local request wailing 
YesForRequest ( < t, [], Inputs > 1 ); H for this input, that means Yes. 






Flush timeout 


StarlFiushtng( llmestamp ot the oldest message. () J; 

// Empty return palh indicates that lite request is local. 


< Your Next Time?, t, 
lPo...P„)> 


Il( P is among |P„...P«] || l< T ) ( // If request lias made u cycle - assume 
Yes. 

Send to outpul to P„: < Yes, t. [Po...P„J >; // Or, if local clock is alr eady 
"turn; // alieud oft. definitely Yes. 

) 

if( there is a message older than t on some input ) { 

Send to P w : < No, t, 1P 0 ...P.,] >; // 77ie/i assume Ho. 
return; 

) 

// Tliis process is not able to answer immediately and it slatls flushing. 
StarlFlushlng( I. IPo...P n ] ); 




< Yos, t, [Po ..P„P| > 
*ve response on input 


il{ Find( < t, (Pa-Pa), Inputs > ) ) 

YesForRequest ( < I, (P 0 ...Po). Inputs >, i ) 


<No. t, lPo...P*P|> 

neguliuc response 


il( Find( < I. (Po...P„J. Inputs > ) ) 

NoForRequest ( < I. (P 0 ..P«1, Inputs > ) 
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Stale Handling message 



Process sends output 
messugc 



Send il wilh the timeslamp T; 



Done 



if( there ere messages on ail inputs ) { 
DeliverTheOIUestMessage(); 
telurn; 

} 

H( there is a non-empty input ) 

Set flush timer; 
Next stale = Idle; 



// // alt itiputs are still non-empty, 
II deliver the next oldest message. 



II Ollxcrwise, if titer e is a non-empty input, 
II a new local flu slung must be sclieduled. 



Functions 



void DeliverTheOldeslMessage() { 

T = maximum(T. timeslamp of the oldest message ); // Before delivery, live local cluck is set 
T.AddNewProcess(). // to tlic value of lite oldest timeslamp, und 

Deliver( the oldest message ); // a new process is added to the path in llus vutuc. 

Next slate = handling Message. 

) 



void CanceILocalFlushing() ( 
Cancel Hush limer; 
it( Flnd( < l, [), inputs > ) ) 
Delele( < I. |J. Inputs > ); 

) 



// Cancelling local fluslung activity includes 

II cancelling ifie flush timer (fust in case it is set) 

II and deletion of a local request record (if any). 



void StartFlushtny( TTime t. TPath path ) { // Fluslling upon local or remote request starts tvith 

lor( all empty inputs ) // sending requests to all empty inputs 

Send to input: < Your next lime?, t. Path*P >; // (P is added to tlte return path) 

New( < I. Palh. Set ol empty inputs > ); // and creation of a new request record. 

J 



void YosForRoquesl { TRecord < I. Palh. Inputs >. Tlnpul I ) { // On positive response on input i 
il{ I * Inputs ) // IfUve record lias uheudy received this, 

return; // infonnation tlten it doesn 't cure. 

Inputs = Inputs \ t. // Tlu* input is removed from tlie Inputs set of the request rccoid 

il( Inputs == 0 ) I // Iftliis set becomes empty, no more respotises arc needed. 

Delele( < I, Palh, Inputs > ); 

il( Palli == [| ) // and if it is a local request 

DclivcrTheOldeslMessage{);- // Hie oldest message is delivered. 
It Successful channel fluslxing. 
else // Otlierxuise it is a remote request. 

Send lo the last process In Ihe Path: < Yes. \, Path >; // Yes is sent to the next process 
J • // in live return path. 

) 



void NoForRoquesl ( TRecord < t, Path, Inputs > ) ( 
Delete! < t. Path. Inputs > ); 
if( Path == (| ) 

// Unsuccessful cliannel flushing 
Set (lush timer; 

else 



// Negative response - acted on immediately. 
II 77u» corresponding record is deleted. 
II If it is a local request 

II than restart tlte flush timer. 

II Otlierwise, for a remote request. 



Send to the last process In the Palh: < No. t. Path >; II No is sent to tlte next process 
) II in the return patlt 
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To illustrate the, work of the algorithm, consider the 
example in Figure 4 in conjunction with one possible channel 
flushing message sequence as shown in Figure 6. " ? " denotes 
here "Your Next Time?", M Y" stands for Yes, and " N " for No. 
5 Sets near the vertical axes represent the sets of processes 
from which responses are still wanted (Inputs in the above 
terminology) . M ok" means that a process is sure it will not 
be sending anything older than the timestamp of the request, 
(i.e. 23). "loop"- means that a process has found itself in 
10 the return path of the request. 

It is evident that the channel flushing procedure 
consists of two waves: a fanning out request wave and a 
fanning in response wave. The request wave turns back and 
becomes the response wave as soon as the information needed 
15 for the initial request is found. 

The "timestamp assigner" at the border with the 
environment treats the channel flushing request in the 
following way. 



RealClock() returns Ihe real clock reading; input Is the unique external input identifier. 


Event 


Action 


External input message arrives 


Send it with the timestamp TTime{ ReaICIock(). Input ); 


< Your Next Time?, t, (P 0 ...P«1 > 
Jlush request 


if( t < TTime( RealClock(), Input ) ) ft If I is older Oxan local time 
Send < Yes, t. IP 0 ...P«] >, " Tiien definitely Yes 

else 

Send < No, t, (Po ..P«J >; // Otherwise assume /Vo 
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Sender channel flushing is conceptually simpler, and 
significantly more efficient than receiver channel flushing, 
although it does not provide a solution to the loop problem. 
In sender channel flushing, each output channel of each 
5 process has a timeout associated with it. This timeout is 
reset each time a message is sent down the channel. The 
timeout can be either a logical timeout (i.e. triggered by 
some incoming message with a sufficiently later timestamp) 
or a physical timeout. If the timeout expires before being 

10 reset then a sender channel flush is initiated down that 
channel. The channel flush consists of a ' non -message ' 
which is sent down the output channel . The receiver can use 
it to advance the time of the channel by allowing the time 
service to accept earlier messages waiting on other chan- 

15 nels. When the non-message is the next message to be 
accepted, then the time service simply discards it. However, 
the non-message, by advancing the receiver's local clock, 
can cause logical timeouts on output channels of the 
receiver; hence causing a cascading sender channel flush. 

20 The timestamp assigners also participate in sender 

channel flush; they have to use a physical timeout. In 
general, using both sender and receiver channel flushes is 
recommended; preferably with some sender channel flushes 
piggy backed upon receiver channel flush response messages. 

25 To provide usable middleware implementing the time 

model it is necessary to relax some of the more restrictive 
- assumptions about the system being built . Three special 
processes that need to be created and integrated with such 
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middleware are now considered, as is a full treatment of 
loops in the dataflow. 
The bus 

The bus is a process that allows multiple output ports 
5 from any number of processes to be connected to multiple 
input ports on other processes. The term 'bus' is taken from 
Harrison (A Novel Approach to Event Correlation, Hewlett- 
Packard Laboratories Report No. HPL-94-68 Bristol UK 1994) 
and is intended to convey the multiple access feature of a 
10 hardware bus. 

The bus implements a multicast as a sequence of 
unicasts, its operation being shown in Figure 7. 

The output channels are ordered, (shown in the diagram 
as 1, 2, 3, 4). When a message is delivered by the time 
15 service to any of the input channels, the bus outputs an 
.identical message on each of its output channels in order. 
The time service computes the timestamp for these in the 
normal way, as shown. 

A bus acts as a message sequencer, ensuring that all 
20 recipients of a series of multicasts receive the messages in 
the same order (as shown) . 
The delay 

In a non-distributed system it may be possible to set 
a timer, and then have an event handler that is invoked when 
25 the timer runs out. This alarm can be seen as a spontaneous 
event. Within the time model, it must be ensured that 
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spontaneous events have a unique time. The simplest way of 
achieving this is to treat spontaneous events just like 
external events. A timestamp is allocated to them using a 
real clock and a unique input identifier. Moreover, a 
process can schedule itself a spontaneous event at some 
future time which again will get a timestamp with real part 
coming from the scheduled time. Having thus enabled the 
scheduling of future events the delay component can be 
created as schematically shown in Figure 8. 

For each, input message the delay generates an output 
message at some constant amount of time, 5, later. The time 
of the generated message is given by the sum of 5 and the 
first time in the timestamp (the real time part) . The rest 
of the path part of the timestamp is ignored. The input 
15 identifier part of the time stamp is changed from the 
original, 1, to the input identifier of the delay, 1'. 
There are large efficiency gains from fully integrating 
delays with the receiver and sender channel flush 
algorithms. The responses to flush requests should take the 
20 length of the delay into account, as should any relaying of 
flush requests through the delay. 

The plumber 

The plumber (the topology manager) is a special process 
that manages the creation and destruction of channels and 
25 processes. The plumber has two connections with every 
process in the system. The first is for the process to make 
topology change requests to the plumber; the second is for 
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the plumber to notify "the process of topology changes that 
affect it (i.e. new channels created or old channels 
deleted) . The plumber can create and delete processes that 
have no channels attached. The plumber has a specific 
minimum delay between receiving a topology change request 
and doing it. This is the key to a feasible solution to 
topology changes within this time model. The reason that 
topology changes are difficult for (pessimistic implementa- 
tions of) the proposed time model is that for a process to 
be able to accept a message it must know that it is the 
oldest message that will arrive. If the topology is unknown 
then all other processes within the application must be 
asked if they might send an older message. This is implaus- 
ible. The plumber acts as the single point one needs to ask 
about topology changes. Moreover, the minimum delay between 
the request for a topology change and its realisation 
ensures that the plumber does not need to ask backward to 
all other processes. For large systems, or for fault 
tolerance, multiple plumbers are needed, and these can be 
arranged hierarchically or in a peer-to-peer fashion. As 
with the delay process the plumber needs to be integrated 
with the channel flush algorithms. 

Loops 

Loops generate issues for the time model and typical 
loops generate a need for many auxiliary messages. A loop 
without a delay can only permit the processing of a single 
message at any one time anywhere within the loop (it is said 
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chat che loop " locksteps " ) . Three solutions to these 
problems are examined. 

Removing loops 

The traditional design models, client/server, mas- 
5 ter/siave, encourage a control driven view of a distributed 
system, which leads to loops. A more data driven view of a 
system, like the data flow diagrams encouraged in structure 
analysis, is typi-cally less loopy. 

Moreover, where a first cut has loops, a more detailed 
10 analysis of a distributed system may show that these loops 
are spurious. The data -flows, rather than feeding into one 
another, feed from one submodule to another and then out. 
For example, in Figure 9, there is an apparent loop between 
process A and process B, (a flow from A feeds into B which 
15 feeds back into A) . But when one looks at the sub-processes 
Al , A2 , Bl , B2 there are, in fact, no loops, only flows. 

Co- locate the processes in a loop 

If there is a loop for which other solutions are not 
appropriate, it will be found that only one process within 

20 the loop can be operational at any one time. It will 
normally be better to have this, and put all the processes 
in the loop on the same processor. There will be no penalty 
in terms of loss of parallelism. This approach will minimise 
the cost of the auxiliary messages, because they will now be 

25 local messages. 

Break the loop using a delay • .- - ■ # 
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informally, the problem with a loop is feedback. 
Feedback happens when an input message to a process strongly 
causes another message (the feedback) to arrive later at the 
same process. Under the strong causality axiom, feedback is 
strongly caused by the original messages, and hence comes 
before ail subsequent messages. Hence any process in a loop 
must, after processing every message, first ascertain 
whether there is any feedback, before proceeding to deal 
with any other input. A delay process is a restricted 
relaxation of strong causality, since each input to the 
delay does not strongly cause the output, but rather 
schedules the output to happen later. Hence, if there is a 
delay within the loop, then a process can know that any 
feedback will not arrive until after the duration of the 
delay. Hence it can accept other messages arriving before 
the feedback. 

A difficult case 

The example in Figure 10 presents specific problems of 
both semantics and implementation for feedback. 

in each of the four cases we see a message with data a 
arriving at a bus B and being multicast to processes A and 
C. A responds to the message a by outputtmg a message with 
data 0 which is fed back into the bus, and hence multicast 
to A and C. When it arrives at A no further feedback is 
produced. If the bus sends to C before A (Figure 10a) then 
no issues arise-, the original message is multicast to both 
parties, and then the feedback happens and is multicast. 
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If, on the other hand, the bus sends to A before C then 
the feedback happens before the original message is sent to 
C. The order in which C sees the feedback' and the original 
message is reversed (Figure 10b) . This indicates that strong 
5 causality and feedback require a re-entrant processing 
similar to recursive function calls. Such re-entrant 
processing breaks the atomicity of invocations and also 
needs significantly more channel flushing messages than the 
non-re-entrant algorithm that has been presented. The 

10 simplest form algorithm Alg.l would incorrectly output the 
later message from B to C before the earlier message (Figure 
10c) . Without re-entrant processing there is a conflict 
between strong causality and the sent after relation. The 
later version algorithm Alg.2, refines the 

15 DeliverTheOldestMessage function to ensure that all incoming 
messages are delayed (with the minimum necessary delay) 
until after the previous output (Figure lOd) . This implemen- 
tation obeys the time model, but (silently) prohibits non- 
delayed feedbacks. At the theoretical level this obviates 

20 the necessity for re-entrancy and prefers the sent after 
relation to strong causality. At the engineering level, this 
can be seen as a compromise between the ideal of the time 
model and channel flushing costs. 

A slight more exhaustive account of the above is given 
25 in the priority documents accompanying this Application. 
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CLAIMS 

1. A complex computing system comprising a plurality 
of nodes connected to each other by channels along which 
timestamped data messages are sent and received, each 
timestamp being indicative, generation by generation, of its 
seniority acquired through its ancestors' arrival in the 
system and in any upstream nodes, and each node comprising: 

means for storing each input data message, 

means for determining the seniority of input data messages 
by progressive comparison of respective generations in the 
timestamps until the first distinction exists, 
means for delivering these messages for processing, 
means for applying a timestamp to each output message 
derived from such processing comprising the immediately 
ancestral message's timestamp augmented by a new generation 
seniority indicator consistent with the ordering, and 
means for outputting such ordered and timestamped messages. 

2. A complex computing system as claimed in Claim 1, 
wherein the delivery means is arranged to deliver messages 
in order according to which message has the most senior 
timestamp indicator . 

3 . A complex computing system as claimed in Claim i 
or 2, wherein for a data message received from outside the 
system the initial timestamp indicator includes an 
indication of the time of receipt of said data message at 
the node . 
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4. A complex computing system as claimed in Claim 1, 
2 or 3 , wherein for a data message generated by a node of 
the system the new generation seniority indicator of the 
timestamp includes an indication of the place of said data 

5 message in the ordered sequence of such messages at said 
node . 

5. A complex computing system as claimed in any 
preceding claim, wherein monotonic integers are utilised as 
said generation seniority indicators in the timestamps. 

10 6 . A complex computing system as claimed in any 

preceding claim, wherein the delivery means of a node 
delivers data messages only once a message has been received 
on each of the input channels of said node. 

7. A complex computing system as claimed in any 
15 preceding claim, wherein the delivery means of a node 

delivers data messages only when at least one data message 
received on. each of the input channels of said node is 
stored in the storage means. 

8. A complex computing system as claimed in Claim 6 
20 or 7, wherein each node is adapted to perform at least one 

channel flushing routine triggerable by lack or paucity of 
channel traffic. 

9. A complex computing system as claimed in any 
preceding claim, wherein all data messages caused by a first 

25 data message anywhere in the system are delivered to a node 
before any messages caused by a second data message, junior 
to the first data message, are delivered to said node. 
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10. A method of ordering data messages within a 
complex computing system comprising a plurality of nodes 
connected to each other by channels along which data 
messages are sent and received, the method comprising, for 
each node, times tamping each message on arrival, queuing 
messages until a message hai been received on each input 
channel to the node, and delivering the queued messages for 
processing sequentially in accordance with their timestamps, 
the message having the most senior timestamp being delivered 
first, wherein the timestamping at each node is cumulative 
so that the timestamp of a particular message indicates the 
seniority acquired by that message, generation by gener- 
ation, and wherein the seniority of one message against 
another is determined by the progressive comparison of 
respective generations in the timestamps until the first 
distinction exists. 

11. A complex computing system comprising a plurality 
of nodes between which data messages are exchanged, wherein 
after the arrival at a node of a message, delivery of the 
message by the node is delayed until after the delivery and 
consequences of all more senior messages which affect the 
node . 

12. A complex computing system as claimed in any one 
of Claims l to 9 or 11, wherein the system is either a 
distributed computing system, a symmetric mul ti -processor 
computer, or a massively parallel processor computer. 
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